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Abstract 

This paper proposes an Approximate n-gram Markov Model for bag generation. 
Directed word association pairs with distances are used to approximate (n-l)-gram and 
n-gram training tables. This model has parameters of word association model, and 
merits of both word association model and Markov Model. The training knowledge 
for bag generation can be also applied to lexical selection in machine translation design. 

1. Introduction 

Natural language generation (Zock and Sabah, 1988; Dale, Mellish and Zock, 1990) 
forms an important component of many natural language applications, e.g., man- 
machine interface, automatic translation, text generation, etc. Bag generation (Brown, 
Cocke, et ah, 1990) is one of natural language generation methods. Given a sentence, 
we cut it up into words, place these words in a bag and try to recover the sentence 
from the bag. In corpus-based approach (Church and Mercer, 1993), a language 
model should be provided to measure the possible candidates. Markov Model (Kuhn 
and Mori, 1990) and word association model (Church and Hanks, 1990) are two 
famous models in language modeling. Markov Model has capabilities to keep the 
linear precedence relations in the context, so that it is useful to the application of bag 
generation. However, the parameters are tremendous in high order Markov Model. 
Word association model can capture the long distance dependency relations in the 
context under the postulation that the window size is the length of sentence. Thus, it is 
useful to the applications such as lexical selection. This paper will propose an 
Approximate Markov Model, which has merits of these two models. 

2. Approximate Markov Model 

Let S=<*, W]^, W2, WjQ, *> be an arrangement in bag generation. The star symbol 

marks the beginning (wq) and the ending (Wjq+i) of the sentence. The probability of S 
in trigram Markov Model is measured as follows: 

P(S) = P(<*, wi, W2, Wm, *>) 

m-l 

= p(*)*p(wii*)*np(wi.2iwr) 



m-1 

np(wi,wi+i,wi+2) 

i=0 

m-1 

np(wi,wi+i) 



This formula utilizes trigram training table (numerator part) and bigram training table 
(denominator part) to compute the probability of an arrangement. It can be 
approximated by the following formula: 



]^Min(P(wi,Wi+i,l),P(wi+i,Wi+2>l)>P(wi,Wi+2>2)) 



(1) 



np(wi,wi+i,i) 



where Min denotes a minimal function, 

P(wi,Wj,j-i) is the probability of a directed word pair (wi,wj) whose 

distance is j-i, e.g., denotes Wjis followed by Wj+j. 

By the notation of directed word pair with distance, the statement "wi+2 follows wj+j 
and wj+i follows wj" (hereafter, <w[,'W[+\,'W[+2^) represented as (wi,wi+i,l), 

(wi4.i,wi+2,l) and (wi,wi+2,2). Consider the following figure. Assume parts (i), (ii) 
and (ui) correspond to the probabilities of (wi,wi+i,l), (wi+i,wi+2,l) and (wi,wi+2,2), 
respectively. In this way, part (iv) denotes the probability of <wi,wi+i,wi+2>. From 
this figure, we know P(wi,wi+x,wi+2) < P(wi,wi+i,l), P(wi,wi+i,wi+2) < 
P(wi+i,Wi+2,l) and P(wi,Wi+]^,Wi+2) < P(wi,Wi+2,2). Thus, the minimum of 
P(wi,Wi+i,l), P(wi+i,wi+2,l) and P(wi,wi+2,2) can be used to approximate 
P(wi,wi+i,wi+2). 




The model formulated by (1) is called Approximate trigram Markov Model. Similarly, 
the following n-gram Markov Model: 

P(S) = P(<*, wi, W2, Wm, *>) 

m-n+ 2 

= P(*)*P(wil*)*P(w2l*,wi)*..*P(w„-2lwr)* riP(wi.„-ilwr-') 

i=0 
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m-n+2 



np(wr') 



— m-n+2 



np(wr') 



can be approximated by: 

m-n+ 2 

Y[ Min i,j(k<i<j<n+k-i) P( Wi , w j , j - i) 



k=0 



m-n+2 



(2) 



i,j(k<i<j<n+k-2) P( Wi , W j , j - i) 



k=l 



Formula (2) denotes Approximate n-gram Markov Model. Assume the vocabulary 
size is V, and the average sentence length is L. The number of parameters of 

Approximate Markov Model is always 0((L-l)*v2) no matter which order it has. 

Markov bigram and trigram Model have 0(V^) and 0(v3) parameters, respectively. 
The number of parameters multiplies by V when the order increases by one. Thus, 
Approximate Markov Model can be used to enlarge the window size, when the 
parameter issue is considered. 

3. Bag Generation Algorithm 

The bag generation algorithm under (Approximate) n-gram Markov Model is shown 
below. 



insert starting node into queue 
while not empty queue do 
begin 

initialize an empty list 

repeat 

remove a node from queue, and assign it to current node 

if current node ^ final node then 

begin 

expand current node and 

merge to the list if any two paths satisfy all of the following 

conditions: 

(1) the path length should be longer than n-1. 

(2) the lengths of these two paths should be equal. 

(3) the last n-1 nodes on these two paths should be equal. 

(4) these two paths should cover the same words. 

end 

else merge to the list 
until empty queue 

if current node ^ final node then assign list to queue 

end 

generate the result from list, and check whether it is error or not. 
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The merge operation keeps the path with higher probability, and discards the path with 
lower probability. The four conditions in the above algorithm should be met if 
dynamic programming technique is used. The following proposition clarifies this point 
for Markov Model. Approximate Markov Model has the similar proof. 

Proposition. The merge operation should obey the following conditions, if n-gram 
Markov Model is adopted: 

(1) The path length should be longer than n-1. 

(2) The lengths of these two paths should be equal. 

(3) The last n-1 nodes on these two paths should be equal. 

(4) These two paths should cover the same words. 
Proof: 

The first two are the basic definitions for n-gram Markov Model. In this model, 
the system will use the last n-1 words to predict the probability of the current word. 
Let the probabilities of two paths Hi and H2 be P(Hi) and P(H2), and P(Hi)>P(H2). 

When the next word wm (m>n-l) is read, their probabilities become: 

P(Hi)*P(Winlwi(in-n-hl> wi(in-l)) and 

P(H2)*P(Winlw2(m-n+l)' W2(m-1)). respectively. 
If the last n-1 words are the same, i.e., '^\{m-n+\)='^l{m-n+\)^ wi(in.i)=W2(in-l). 
then the former is still larger than the later. However, if the last n-1 words on these 
two paths are not the same, then the former may be smaller than the latter. Thus, 
merging may introduce the error results. 

In fact, the first three conditions are enough for the other Markov-based 
applications such as phone-to-text transcription, etc. However, there is a problem in 
bag generation application, if we do not obey the last condition either. Consider a 
general case. Let the two paths Y{\ and H2 have the following forms: 

Hi: wio, wii, wi(in-n), W(in.n+1), W(in.i) and 

H2: W20, W21, W2(m-n)' W(in-n-i-l). W(in.i). 
If {wio, wii, wi(in.n)} is not equal to {W20, W21, W2(m-n)}' there must exist 
some wji and W2j such that w^^v^2y If P(Hi)>P(H2), then the path involving wji, i.e., 

W20' W2L W2(m-n)' W(in-n-i-l)' — W(in-l) wji 
will be neglected. This path may have higher probability, so that error occurs. ■ 

The cost paid by the Approximate n-gram Markov Model is: each minimal value 
in the numerator part and denominator part of Formula (2) is derived from n*(n-l)/2 
pairs and (n-l)*(n-2)/2 pairs, respectively. Consider the numerator part. For each 
tuple <W]f, W]f+i, W]f+n.i> (0<k<m-n+2), its probability is determined by P(wi, wj, 

j-i)(k<i<j<n+k-l). The complexity of an algorithm to select the minimum from n*(n- 

l)/2 pairs is Oiv?-). It is a terrible overhead. Here, a special data structure, i.e., a ring 
of n-1 elements, is adopted. Each element records the minimum of k-i-n-l-i 
probabilities P(wj, Wi4.p, p) (l<p<(n-l)-(i-k)). The index i is ranged from k to n+k-2. 

The rninimum of the n*(n-l)/2 pairs can be computed from these n-1 elements. When 
k is increased by one, i.e., the tuple <W]^+i, wi^+2, w]^+n> is inspected, only these 

(n-1) elements are considered instead of n*(n-l)/2 pairs. In other words, the position 
in the ring for P(w]f, wj^+p, p) (l<p<n-l) is free, and is used to record P(w]f+j^.i, 
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Wk+n' P(wk+p> wj^+ii' n-p)(l<p<n-2) are compared with the corresponding 
elements in the ring. This can be done in 0(n) time. 

4. Experimental Results 

BDC corpus, which is a Chinese segmented corpus, is adopted as the source of the 
training data. It includes 7010 sentences about 50000 words. For each sentence S=<*, 
wj, W2, WjQ, *> in the training corpus, total (m+l)*(m+2)/2 directed word 

association pairs, which are of the form (wj, Wj, j-i)(where 0<i<j<m+l), are generated. 

The experimental results (distribution of error sentences) of bag generation by using 
Markov Model and Approximate Markov Model are shown in the following table. Mi 
and AMj denote i-gram Markov Model and Approximate Markov Model, respectively. 



sentence 
length 


total test 
sentences 




Markov Model 


Ap 


proximate Mar 


cov Model 


M2 


M3 


M4 


M5 


AM2 


AM3 


AM4 


AM5 


AMn 


1 


6 





























2 


34 





























3 


121 





























4 


213 


1 











1 














5 


297 





























6 


329 


3 











3 














7 


234 


4 











4 


2 


1 








8 


216 


11 











11 


1 


1 








9 


183 


6 











6 














10 


170 


8 











8 














11 


129 


11 











11 














12 


68 


13 











13 


1 


1 








total 


2000 


57 











57 


4 


3 









It is trivial that AM2 is equal to M2. The other results demonstrate that the power of 
approximate Markov Model is close to that of Markov Model. 



5. Concluding Remarks 

This paper proposes a directed word association model with distance to approximate 
Markov Model. It can increase the order of language model, and keep the number of 
parameters unchanged. The experimental results show that the performance of 
Approximate Markov Model and Markov Model is very close. Besides, the training 
knowledge for bag generation can be also applied to lexical selection. The co- 
occurrence of a word pair can be computed easily by sum of the related directed word 
association pairs. The uniform knowledge facilitates statistics-based machine 
translation design. 
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